COVID-19 has been dominating our thoughts, our lives, and the news for months now. As this deadly pandemic ravages the world, the news have been reporting that racial disparities have deadly implications for African Americans. Reports suggest an overrepresentation of infections, hospitalizations, and deaths for African Americans compared to their white counterparts. This is unsuprising for countless reasons, but I wanted to dig into the data for myself. There are many different ways to approach this analysis, but for simplicity's sake, I use data reporting COVID-related deaths by county and match that to 2010 US census data reporting racial demographics by county. Here, I show data demonstrating that majority black communities are being disproportinately affected by COVID-19.
#pip install chart_studio
#pip install "notebook>=5.3"
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from datetime import datetime as dt
import seaborn as sns
import chart_studio
import chart_studio.plotly as py
import plotly
import plotly.graph_objs as go
import plotly.express as px
import plotly.io as pio
url = "https://usafactsstatic.blob.core.windows.net/public/data/covid-19/covid_deaths_usafacts.csv"
df_us_deaths = pd.read_csv(url)
Link to data: https://usafacts.org/visualizations/coronavirus-covid-19-spread-map/
The data I will be using to show COVID-19 death rates comes from USAFacts.org. USAFacts lists cumulative deaths in each county in each state of the US starting 1/22/20, and includes state and county FIPS. These codes will come in handy later for merging dataframes. USAFacts also has separate data sets for confirmed cases and population adjustments. I will be using confirmed deaths as a metric for severity, and conducting my own county-based population adjustments.
df_us_deaths.head() #check out the data!
df_us_deaths.shape # There are 3195 counties in the US, including unallocated territories
df_us_deaths.isnull().values.any() # yayyy we have all our data!!
#Format column names
#probably going to want dates in datetime format
df_labels = df_us_deaths.iloc[:,:4]
df_dates = df_us_deaths.iloc[:, 4:]
names_old = df_dates.columns.tolist()
names_new = []
for i in names_old :
dtobject = dt.strptime(i, "%m/%d/%y").strftime("%m-%d-%Y")
names_new += [dtobject]
df_dates.columns = names_new
df_us_deaths = pd.concat([df_labels, df_dates], axis = 1)
#also don't want that space in the "County Name" column
df_us_deaths = df_us_deaths.rename(columns = {"County Name":"County"})
#get rid of county info (for now)
df_state_deaths = df_us_deaths.iloc[:, 2:]
#group state rows together
df_state_deaths = df_state_deaths.drop(columns = ["stateFIPS"])
df_deaths_by_state = df_state_deaths.groupby(["State"], as_index=False).agg("sum")
#reshape data frame
df_deaths_by_state = pd.melt(df_deaths_by_state, id_vars = "State").rename(columns = {"variable": "Date", "value": "Deaths"})
#rows with all zeros (no deaths) aren't very informative for us...
df_deaths_by_state = df_deaths_by_state.loc[df_deaths_by_state["Deaths"] != 0, :]
# plot!!
plot = px.line(df_deaths_by_state,
x='Date',
y='Deaths',
color='State',
title = "Deaths by state over time",
width = 1000,
height = 700)
plotly.offline.iplot(plot)
The plots I use here are interactive through plotly. Double clicking on an item in the legend or on a line in the chart will isolate it so you can view only that data. Then, single clicking on additional states will add them to the chart for comparison. Hovering over a specific data point will give you the cumulative deaths up to that specific date in that specific state.
# Extract the total number of deaths to date per county
df_us_deaths['County,State'] = df_us_deaths[['County', 'State']].agg(', '.join, axis=1)
df_county_deaths = df_us_deaths.drop(columns = ["State", "stateFIPS", "County"])
df_county_totals = df_county_deaths.iloc[:, [0,-1,-2]]
df_county_totals = df_county_totals.rename(columns = {df_county_totals.columns[-1]:"Deaths"})
#also, there are a lot of zeros so I'll take these out for visualization purposes...
df_county_totals_deaths = df_county_totals.loc[df_county_totals["Deaths"] != 0, :]
# plot!!
hist = px.histogram(df_county_totals_deaths,
x="Deaths",
nbins = 100,
log_y = True,
title = "Distribution of deaths by county",
marginal = "rug",
hover_name = "County,State",
hover_data = ["Deaths"],
width = 1000,
height = 600,
)
plotly.offline.iplot(hist)
This histogram shows the distribution of counties based on COVID-related deaths. This is interesting as a snapshot, but what we really want to look at with this analysis is the racial demographics of those harder-hit counties. Time to bring in the census data...
#Get census data
al_mo_url = "https://www2.census.gov/programs-surveys/popest/datasets/2010/modified-race-data-2010/stco-mr2010_al_mo.csv"
df_al_mo = pd.read_csv(al_mo_url)
mt_wy_url = "https://www2.census.gov/programs-surveys/popest/datasets/2010/modified-race-data-2010/stco-mr2010_mt_wy.csv"
df_mt_wy = pd.read_csv(mt_wy_url, encoding = 'latin-1')
df_census = pd.concat([df_al_mo, df_mt_wy], ignore_index=True)
The data I will be using to determine racial demographics comes from the United States Census Bureau. Of note, this data comes from the last nationwide census in 2010. Subsequent censuses have only included areas above a certain population threshold (65,000 people), which may disclude areas of interest from this analysis. Thus, until 2020 census data is publicly available, 2010 will have to do.
This data set includes information about sex, Hispanic origin, age group, and race for each county by FIPS. I am interested in looking at racial demographics, but a lot of other cool analyses could be performed with this information.
#for this project, I'm interested in race by region
df_census = df_census.drop(columns = ["SUMLEV", "SEX", "AGEGRP"])
#Gotta fix these column names too
dict_names = {"STATE":"stateFIPS",
"COUNTY":"countyFIPS",
"STNAME":"State",
"CTYNAME":"County",
"ORIGIN":"Hispanic",
"IMPRACE":"Race",
"RESPOP":"Num_res"}
df_census = df_census.rename(columns = dict_names)
#The census countyFIPS are in a different format that the USAfacts countyFIPS :(
a = df_census["stateFIPS"]
b = df_census["countyFIPS"]
df_census.loc[b < 10, "countyFIPS"] = a.apply(str) + "00" + b.apply(str)
df_census.loc[b >= 100, "countyFIPS"] = a.apply(str) + b.apply(str)
df_census.loc[(b >= 10) & (b < 100), "countyFIPS"] = a.apply(str) + "0" + b.apply(str)
df_census["stateFIPS"] = df_census["stateFIPS"].astype(int)
df_census["countyFIPS"] = df_census["countyFIPS"].astype(int)
#df_census.head(10)
df_census["Num_res"].sum() #about 300 million people in the US as of 2010... math checks out
There are 31 different race categories in the US census data, most of which are mixed. It would be difficult to categorize and find meaningful data based on all of these categories, so I'm going to determine which categories comprise the majority of the population and run the analysis based on these categories.
Also, census information separates Hispanic origin from race. I'm going to add all residents who identify as having Hispanic origin to a separate race, and not include these people in the racial group they had initially chose. (i.e. "Hispanic white" --> "Hispanic", "non-Hispanic white" --> "white")
#Create separate race category for everyone who identifies as Hispanic (Race "0")
#if the person identifies as Hispanic, add them to Race 0
df_census.loc[df_census.Hispanic == 2, "Race"] = 0
#Look at histogram of how prevalent these races are to determine what to include in analysis
df_pop_by_race = df_census.groupby(["Race"], as_index=False).sum().drop(columns = ["stateFIPS", "countyFIPS", "Hispanic"]).sort_values(by = "Num_res", ascending = False)
pop = px.bar(df_pop_by_race,
x='Race',
y='Num_res',
title = "Racial makeup of the US",
width = 1000,
height = 600)
pop.update_layout(yaxis_title_text = 'Number of residents',
xaxis_type = 'category')
plotly.offline.iplot(pop)
Based on the above plot, races 1, 0, 2, 4, 3, 6, 8, & 7 make up the overwhelming majority of the US population, so this analysis focuses on those categories, where:
races = [1, 0, 2, 4, 3, 6, 8, 7]
df_census = df_census[df_census["Race"].isin(races)].drop(columns = ["Hispanic"])
To further condense our list, I'm combining the biracial categories with the corresponding non-white race. By all accounts, these people still experience racism. As such, our list will consist of just
df_census.loc[df_census.Race == 6, "Race"] = 2
df_census.loc[df_census.Race == 7, "Race"] = 3
df_census.loc[df_census.Race == 8, "Race"] = 4
#Replace race number indicator with actual race
df_census["Race"] = df_census["Race"].replace({0: "Hispanic",
1: "White",
2: "Black",
3: "American Indian",
4: "Asian"})
#Look at US demographics based on these major groups
df_pop_by_race1 = df_census.groupby(["Race"], as_index=False).sum().drop(columns = ["stateFIPS", "countyFIPS"]).sort_values(by = "Num_res", ascending = False)
df_pop_by_race1["Percent of total population"] = ((df_pop_by_race1["Num_res"]/df_pop_by_race1["Num_res"].sum())*100).round(2)
pop1 = px.bar(df_pop_by_race1,
x='Race',
y='Percent of total population',
title = "US Demographics",
width = 1000,
height = 600
)
pop1.update_layout(yaxis_title_text = 'Percent of total population')
plotly.offline.iplot(pop1)
Here you have the US racial demographics as of 2010 based on the top 5 most prevalent race categories. These are the races that will be included in our analysis.
# sum residents of each race by state
df_census_by_region = df_census.groupby(["stateFIPS", "countyFIPS", "Race"], as_index = False).agg({"Num_res":"sum"})
# Merge residents by race of each region with COVID deaths of each region
df_region_race_deaths = pd.merge(df_census_by_region, df_county_totals, on = ["countyFIPS"])
#Add percent race by county as a column
dfx = df_region_race_deaths.groupby(["countyFIPS"], as_index = False).agg({"Num_res":"sum"}).rename(columns = {"Num_res":"total_res"})
df_percents_by_county = pd.merge(df_region_race_deaths, dfx, on = "countyFIPS")
df_percents_by_county.loc[:, "percent_race"] = ((df_percents_by_county["Num_res"]/df_percents_by_county["total_res"])*100).round(2)
#Add percent death by county as a column
df_percents_by_county.loc[:, "percent_death"] = ((df_percents_by_county["Deaths"]/df_percents_by_county["total_res"])*100).round(5)
df_percents_by_county = df_percents_by_county[["County,State",'total_res','Deaths',"percent_death", 'Race','Num_res', 'percent_race']]
#pd.set_option('display.max_rows', None)
df_percents_by_county
Now we have a data frame that gives us information regarding the total number of residents by race and the total number of deaths due to COVID-19, as well as the percentages of each normalized to respective county population.
#We can subset any county in the US and look at its demographics and COVID death rate
df_dc = df_percents_by_county.loc[df_percents_by_county["County,State"]=="Washington, DC", :]
df_dc
# order counties by decreasing percent death, identified by the majority race
df_majority_counties = df_percents_by_county.loc[df_percents_by_county.groupby("County,State")["percent_race"].idxmax()].drop(columns = ["total_res", "Deaths", "Num_res"]).sort_values("percent_death", ascending = False)
df_majority_counties = df_majority_counties.loc[df_majority_counties["percent_race"] > 50, :]
# pull counties with highest percent death
df_top_counties = df_majority_counties.head(10)
#pull top counties by race
df_tc_wh = df_top_counties.loc[df_top_counties["Race"] == "White",:]
df_tc_co = df_top_counties.loc[df_top_counties["Race"] != "White",:]
#plot
df_tc = pd.DataFrame({"Race": ["Majority white", "Majority POC"],
"# counties in the top 10 counties by death rate":[df_tc_wh.shape[0], df_tc_co.shape[0]]})
bar = px.bar(df_tc,
y = "Race",
x = "# counties in the top 10 counties by death rate",
color = "Race",
color_discrete_sequence=px.colors.sequential.Rainbow,
title = "1a. Number of counties in the top 10",
width = 800,
height = 400,
orientation = "h")
bar.update_traces(hovertemplate = None, hoverinfo='skip')
bar.update_yaxes(title = " ")
plotly.offline.iplot(bar)
This doesn't look too remarkable: in a list of US counties with the top 10 highest death rates, there are 4 majority white and 6 majority non-white counties. However, let's consider the fact that there are far more majority white counties in the US...
#calculate majority counties (again) and divide by race
df_maj_wh = df_majority_counties.loc[df_majority_counties["Race"] == "White",:]
df_maj_co = df_majority_counties.loc[df_majority_counties["Race"] != "White",:]
#plot
df_tc = pd.DataFrame({"Race": ["Majority white", "Majority POC"],
"Number of counties in the US":[df_maj_wh.shape[0], df_maj_co.shape[0]]})
bar1 = px.bar(df_tc,
y = "Race",
x = "Number of counties in the US",
color = "Race",
color_discrete_sequence=px.colors.sequential.Rainbow,
title = "1b. Number of counties in the US according to majority race",
hover_data = ["Number of counties in the US"],
width = 800,
height = 400,
orientation = "h")
bar1.update_yaxes(title = " ")
plotly.offline.iplot(bar1)
Here we can see that the US county racial demographics is heavily skewed toward majority white - there are over 10 times as many majority white counties in the US as there are majority non-white. So now let's take another look at the makeup of our top 10 counties by death rate, considering the skew in overall county demographics...
#calculate percentage of total counties
percent_white = (df_tc_wh.shape[0]/df_maj_wh.shape[0])*100
percent_poc = (df_tc_co.shape[0]/df_maj_co.shape[0])*100
#plot
df_tc = pd.DataFrame({"Race": ["Majority white", "Majority POC"],
"% counties in top 10 COVID death rates":[percent_white, percent_poc]})
bar2 = px.bar(df_tc,
y = "Race",
x = "% counties in top 10 COVID death rates",
color = "Race",
color_discrete_sequence=px.colors.sequential.Rainbow,
title = "1c. Percent of counties within race in the top 10 highest death rates",
hover_data = ["% counties in top 10 COVID death rates"],
labels = {"% counties in top 10 COVID death rates":"% counties"},
width = 800,
height = 400,
orientation = "h")
bar2.update_yaxes(title = " ")
bar2.update_xaxes(tickprefix = "%")
plotly.offline.iplot(bar2)
Now, we are looking at the same information as plot 1a (makeup of counties in top 10 by percent death rate), but normalized to total number of counties by race. This accounts for the heavy skew towards majority white counties. As of 5/03, 2.82% of counties in the US with a majority non-white population are in the top 10 counties with the highest percent death by COVID-19 according to percent death rate. Only 0.14% of majority white counties are in this list.
# find the percent race based on white and black for each county
race = df_percents_by_county["Race"]
df_race = df_percents_by_county.loc[(race == "White") | (race == "Black"), :]
#plot
px.defaults.width = 800
px.defaults.height = 800
scat = px.scatter(df_race,
x = "percent_race",
y = "percent_death",
size = "percent_death",
hover_data = ["County,State", "percent_race", "percent_death"],
facet_row = "Race",
color = "percent_death",
color_continuous_scale=px.colors.sequential.Burgyl,
width = 1000
)
scat.update_yaxes(tickprefix = "%")
scat.update_layout(xaxis_title_text = 'Percent county pop that identifies as respective race',
title = "2. Death rate by racial makeup of county"
)
plotly.offline.iplot(scat)
The above plot demonstrates the relationship between the percentage of the county population that is either white or black, and the percentage of the county population that died due to COVID-19. Big bubbles in the top right quadrant of either plot represent counties with a high percentage of that particular race as well as a high number of deaths relative to county population size. The plot representing people who identify as Black or African-American has several of these markers, indicating higher death rates in majority Black counties, while the plot representing people who identify as non-Hispanic white does not.
Next, we focus in on counties where the majority of the population identifies as either Black/African American or white. We can determine the majority race of a county by defining it as greater than 50% for that county. The rest of this analysis will focus on "majority black" and "majority white" counties in this way.
# Determines counties where the majority race is the race number inputted, majority defined as > 50%
def majority_race(race):
r = df_percents_by_county["percent_race"]
df_majority = df_percents_by_county.loc[r > 50, :]
df_majority = df_majority.loc[df_majority["Race"] == race, :]
return df_majority
#create data frames based on majority race
df_black_majority = majority_race("Black")
df_white_majority = majority_race("White")
#df_hisp_majority = majority_race("Hispanic")
#df_asian_majority = majority_race("Asian")
#df_native_majority = majority_race("American Indian")
# combine majority white and majority black data frames
df_race_majority = pd.concat([df_white_majority, df_black_majority])
# plot
px.defaults.width = 1000
px.defaults.height = 600
hist2 = px.histogram(df_race_majority,
x="percent_death",
nbins = 20,
log_y = True,
title = "3. Distribution of deaths by racial makeup of county",
color = "Race",
color_discrete_sequence=px.colors.sequential.Rainbow,
opacity = 0.6,
barmode = "overlay",
histnorm = "percent",
hover_name = "County,State",
hover_data = ["percent_death", "percent_race"],
labels={'percent_death':'percent death', "percent_race":"percent race"},
marginal = "box"
)
hist2.update_layout(xaxis_title_text = 'Percent county pop that died due to COVID-19',
yaxis_title_text = "percentage of counties")
plotly.offline.iplot(hist2)
Above are shown the distributions of the percentage of the county population that died due to COVID-19 for both black majority and white majority counties. There are 2,807 counties in the US that are majority white, but only 102 that are majority black, therefore, the y-axis has been standardized to percent. Majority black counties have death rates skewed further right than majority white counties (i.e. more counties have a greater percentage of COVID-related deaths).
#percent of black counties with a death rate over .1%
df_b = df_black_majority.loc[df_black_majority["percent_death"] > .1, :]
percent_b = round(((df_b.shape[0]/df_black_majority.shape[0])*100), 4)
#percent of white counties with a death rate over .1%
df_w = df_white_majority.loc[df_white_majority["percent_death"] > .1, :]
percent_w = round(((df_w.shape[0]/df_white_majority.shape[0])*100), 4)
df_bad_counties = pd.DataFrame({"Race": ["Majority white", "Majority black"],
"%counties death rate > 0.1%":[percent_w, percent_b]})
#plot and compare
px.defaults.width = 800
px.defaults.height = 400
bar3 = px.bar(df_bad_counties,
y = "Race",
x = "%counties death rate > 0.1%",
color = "Race",
color_discrete_sequence=px.colors.sequential.Rainbow,
title = "4. Percent of counties with a death rate > 0.1%",
hover_data = ["%counties death rate > 0.1%"],
labels={'%counties death rate > 0.1%':'% counties'},
orientation = "h")
bar3.update_yaxes(title = " ")
bar3.update_xaxes(tickprefix = "%")
plotly.offline.iplot(bar3)
As of 5/04, almost 5% of counties in the US with a majority black population have had more than 0.1% of their population die due to COVID-19. The same can be said for only 0.32% of majority white counties.
#majority white counties with at least 1 death due to COVID
df_white_majority_death = df_white_majority.loc[df_white_majority["Deaths"] > 0, :].sort_values("percent_death", ascending = False)
#majority black counties with at least 1 death due to COVID
df_black_majority_death = df_black_majority.loc[df_black_majority["Deaths"] > 0, :].sort_values("percent_death", ascending = False)
#plot and compare
percent_death_white = (df_white_majority_death.shape[0]/df_white_majority.shape[0])*100
percent_death_black = (df_black_majority_death.shape[0]/df_black_majority.shape[0])*100
df_county_rates = pd.DataFrame({"Race": ["Majority white", "Majority black"],
"Percent of counties with death":[percent_death_white, percent_death_black]})
bar4 = px.bar(df_county_rates,
y = "Race",
x = "Percent of counties with death",
color = "Race",
color_discrete_sequence=px.colors.sequential.Rainbow,
title = "5. Percent of counties with at least one COVID-related death",
hover_data = ["Percent of counties with death"],
labels={'Percent of counties with death':'% counties'},
orientation = "h")
bar4.update_yaxes(title = " ")
bar4.update_xaxes(tickprefix = "%")
plotly.offline.iplot(bar4)
As of 5/04, 81.37% of counties in the US with a majority black population have had at least one death due to COVID-19. The same can be said for only 46.45% of majority white counties. In other words, majority black counties are almost twice as likely to see death due to COVID-19.
Though over 2000 counties in the US have a population comprised mostly of people who identify as non-Hispanic white, counties with majority non-white populations, specifically Black or African-American, are being disproportionately affected by COVID-19. Overall death rates are higher, and they top the lists of counties with the highest COVID death rates. These data support what news sources are reporting regarding the issue. Of note, death rates are likely also impacted by socioeconomic status, healthcare access, and crowdedness (people per square mile), though these issues are also linked to racial disparities.